20 - Deep Learning - Regularization Part 4 [ID:15398]
50 von 83 angezeigt

Welcome back to deep learning. So today we want to look at a couple of initialization techniques

that will come in really handy throughout your work with deep learning networks.

And there is very little theory behind the best solutions that we have at the moment.

So you may wonder why does initialization matter? If you have a convex function actually it doesn't

matter at all because you follow the negative gradient direction and you will always find the

global minimum. So no problem for convex optimization. However many of the problems that

we are dealing with are non-convex and a non-convex function may have different local minima. And now

if I start at this point you can see that I achieve one local minimum by the optimization.

But if I were to start at this point you can see that I would end up with a very different

local minimum. So for non-convex problems initialization is actually a big deal and neural

networks with non-linearity are in general non-convex. So what can be done? Well of course

you have to work with some initialization and for the biases you can work quite easily and

initialize them to zero. So this is very typical. Keep in mind that if you're working with relu's

you may want to start with a small positive constant because this is better because of the

dying relu issue. We are happy that it works better than any competing method. For the weights,

well for the weights you need to be random to break the symmetry. We already had this problem in

dropout that we need additional regularization in order to break the symmetry and it would be

especially bad to initialize them with zeros because then the gradient is zero. So this is

something that you don't want to do. Because it doesn't work.

Similar to the learning rate their variance influences the stability of the learning process.

So small uniform Gaussian values work. Now you may wonder how can we calibrate those variances

and let's suppose we have a single linear neuron with weights w and input x and remember that the

capital letters here mark them as random variables. Then you can see that the output is w times x.

So this is this linear combination of the respective inputs plus some bias.

And now we are interested in the variance of y hat. If we assume that w and x are independent

then the variance of every product can be actually computed as the expected value of x

to the power of two times the variance of w plus the expected value of w to the power of two times

the variance of x and then you add the variances of the two random variables. Now if we have w and x

to have zero mean then this would simplify the whole issue because the means would be zero.

So the expected values cancel out and our variance would simply be the multiplication of the two

variance. Now we assume that x n and w n are independent and identically distributed.

In this special case we can then see that essentially the n here scales our variances.

So it's actually dependent on the number of inputs that you have towards your layer and this is an

a scale of the variance with your w n. So you see that the weights are very important and effectively

the more weights you have the more it scales the variance. Machine learning is the science of

sloppiness really. As a result we then can work with Xavier initialization. So we calibrate the

variances for the forward pass, we initialize with a zero mean Gaussian and we simply set the standard

deviation to one over fan in where fan in is the input dimension of the weights. So we simply scale

the variance to be one over the number of input dimensions. In the backward pass however we would

need the same effect backwards so we would have to scale the standard deviation with one over fan out

where fan out is the output dimension of the weights. So you just average those two and compute

a new standard deviation and this initialization is called after the first author of reference 21.

Well what else can be done there's he initialization which then also considers

that the assumption of linear neurons is a problem. So in reference 12 they showed that for

RELU's it's better to actually use the square root of two over fan in a standard deviation.

So this is a very typical choice for initializing the weights randomly.

Then other conventional initial choices is that you do L2 regularization, you use dropout with a

probability of 0.5 for fully connected layers and you use them selectively in convolutional neural

networks, you do mean subtraction, batch normalization and he initialization. So this is

the very typical setup. Okay so what other tricks of the trade do we have left? One important

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:10:00 Min

Aufnahmedatum

2020-05-09

Hochgeladen am

2020-05-10 00:16:05

Sprache

en-US

Deep Learning - Regularization Part 4

This video discusses initialization techniques and transfer learning.

Video References:
Lex Fridman's Channel

Further Reading:
A gentle Introduction to Deep Learning

Tags

initialization backpropagation artificial intelligence deep learning machine learning pattern recognition Feedforward Networks transfer learning
Einbetten
Wordpress FAU Plugin
iFrame
Teilen